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Abstract 



We present a Hamiltonian Monte Carlo algorithm to sample from multivariate Gaussian distri- 
butions in which the target space is constrained by linear and quadratic inequalities or products 
thereof. The Hamiltonian equations of motion can be integrated exactly and there are no parame- 
ters to tune. The algorithm mixes fast and outperforms Gibbs sampling for constraint geometries 
j>. that impose strong correlations among the variables. The runtime scales linearly with the num- 

ber of constraints but the algorithm is highly parallelizable. A simple extension of the algorithm 
permits sampling from distributions whose log-density is piecewise quadratic, as in the "Bayesian 
■^f lasso" model. 
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1 Introduction 



The advent of Markov Chain Monte Carlo methods has made it possible to sample from complex 



multivariate probability distributions Robert and Casella, 2004 , leading to a remarkable progress 



in Bayesian modeling, with applications to many areas of applied statistics and machine learning 
[Gelman et~al] |2004 

and the 



1992 



In many cases, the data or the parameter space are constrained Gelfand et al. , 
need arises for efficient sampling techniques for truncated distributions. In this paper we will focus 
on the Truncated Multivariate Gaussian (TMG), a (f-dimensional multivariate Gaussian distribution 
of the form 



logp(X) oc -lx T MX + r T X 

with X, r 6 M. d and M positive definite, subject to m inequalities 

Qj(X)>0 j = l,...,m, 



(1.1) 



(1.2) 



where Qj(X) is a product of linear and quadratic polynomials. These distributions play a central 



role in models as diverse as the Probit and Tobit models Albert and Chib 


1993. 


Tobiri 


l 1958 , the 


dichotomized Gaussian model Emrich and Piedmonte 


1991, 


Cox and Wermuth 


2002 


, stochastic 


integrate-and-fire neural models Paninski et al. 


2004 


, Bayesian isotonic regression 


Neelon and 


Dunson, 2004 , the Bayesian bridge model expressed as 


a mixture of Bartlett-Fejer kernels [Poison 



and Scott 2011 , and many others. 



1991 , which 



The standard approach to sample from TMGs is the Gibbs sampler Geweke 
reduces the problem to one-dimensional truncated Gaussians, for which simple and efficient sampling 
methods exist [Robert, 1995, Damien and Walker 2001 . While it enjoys the benefit of having no 



parameters to tune, the Gibbs sampler can suffer from two problems, which make it inefficient in 
some situations. Firstly, its run-time scales linearly with the number of dimensions. Secondly, 



even though a change of variables that maps M in (1.1) to the identity often improves the mixing 



speed | Rodriguez- Yam et al. 2004], the exploration of the target space can still be very slow when 



the constraints (1.2) impose high correlations among the coordinates. Figure [T] illustrates this effect 
in a simple example. Improvement over the Gibbs run-time can be obtained with a hit-and-run 
algorithm Chen and Deely, 1992 , but the latter suffers from the same slow convergence problem 



when the constraints impose strong correlations. 

In this paper we present an alternative algorithm to sample from TMG distributions for con- 



straints Qj(X) in (1.2) given by linear or quadratic functions or products thereof, based on the 
Hamiltonian Monte Carlo (HMC) approach. The HMC method, introduced in Duane et al. 1987|, 



considers the log of the probability distribution as minus the potential energy of a particle, and in- 
troduces a Gaussian distribution for momentum variables in order to define a Hamiltonian function. 
The method generally avoids random walks and mixes faster than Gibbs or Metropolis-Hastings 
techniques. The HMC sampling procedure alternates between sampling the Gaussian momenta and 
letting the position of the particle evolve by integrating its Hamiltonian equations of motion. In 
most models, the latter cannot be integrated exactly, so the resulting position is used as a Metropo- 
lis proposal, with an acceptance probability that depends exponentially on the energy gained due 
to the numerical error. The downside is that two parameters must be fine-tuned for the algorithm 
to work properly: the integration time-step size and the number of time-steps. In general the 



1 



values selected correspond to a compromise between a high acceptance rate and a good rate of 

More details of HMC can be found in the 



exploration of the space Hoffman and Gelman, 2011 
reviews by [Kennedy] |1990| and |Neal| |2010 



The case we consider in this work is special because the Hamiltonian equations of motion can be 
integrated exactly, thus leading to the best of both worlds: HMC mixes fast and, as in Gibbs, there 
are no parameters to tune and the Metropolis step always accepts (because the energy is conserved 
exactly). The truncations (1.2) are incorporated via hard walls, against which the particle bounces 
off elastically. The run-time depends highly on the shape and location of the truncation, as most 
of the computing time goes into finding the time of the next wall bounce and the direction of the 
reflected particle. But unlike the Gibbs sampler, these computations are parallelizable, potentially 
allowing fast implementations. 

The discontinuity that a particle experiences when bouncing off a constraint wall is similar when 
the log-density is piecewise quadratic. We show that a simple extension of the algorithm allows us 
to sample from such distributions, focusing on the example of the "Bayesian lasso" model Park 



and Casella, 2008 



Previous HMC applications that made use of exactly solvable Hamiltonian equations include 

and importance 



sampling from non-trivial integrable Hamiltonians Kennedy and Bitar 1994 



sampling, with the target distribution approximated by a distribution with an integrable Hamilto- 
nian | Ra smussen 2003 , Izaguirre and Hampton[ |2004| . 

In the next section we present the new sampling algorithm for linear and quadratic constraints 
along with two example applications; in Section [3] we present the extension to the Bayesian lasso 
model. We have implemented the sampling algorithm in the R package "tmg." 



2 The Sampling Algorithm 

2.1 Linear Inequalities 

Consider first sampling from 



logp(X)oc-ix-X (2.1) 

subject to 

F r X + #>0 j = l,...,m, (2.2) 



Any quadratic form for logp(X), as in ( 1.1 ), can be brought to the above canonical form by a linear 
change of variables. Let us denote the components of X and Fj as 

X = (xi,...,x d ) , (2.3) 

Fi - Uj /;')• (2.4) 

In order to apply the HMC method, we introduce momentum variables II, 

n=( 7 r 1 ,..., 7 r d ), (2.5) 

and consider the Hamiltonian 

#=-X-X + -II-II, (2.6) 
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Exact HMC Gibbs sampler 




Iteration Iteration 

Figure 1: HMC vs Gibbs sampler. Comparison for a two-dimensional distribution with 
logp(x,y) oc -7j(x - 4) 2 - 7j(y - 4) 2 , constrained to the wedge x < y < l.lx and x,y > 0. The ini- 
tial point is (x,y) = (2,2.1). Upper panels: first 20 iterations. Lower panels: Second coordinate of the 
first 400 iterations. The exact HMC sampler moves rapidly to oscillate around y = 4, as desired, while 
the Gibbs sampler mixes relatively slowly. 



such that the joint distribution is p(X, II) = exp(-H). The equations of motion following from (2.6) 
are 

Xi - §^=7r* (2.7) 
air 1 
a it 

7T* = -— = -Xi i = l,...,d (2.8) 



dxi 



which can be combined to 



and have a solution 



%i ~ X{ , (2.£ 



Xi(t) = ajsin(t) + 6jCos(t) . (2-10) 
The constants a, , b{ can be expressed in terms of the initial conditions as 

h = Xi (0) (2.11) 

di = £;(())= 7T;(0) (2.12) 

The HMC algorithm proceeds by alternating between two steps. In the first step we sample II 
from j>(II|X) = p(H) = 7^(0,1^). In the second step we use this II and the last value of X as initial 



3 



conditions, and let the particle move during a time T. The value of X at the end of the trajectory 
belongs to a Markov chain with equilibrium distribution p(X). The value of T is arbitrary; one 
reasonable approach would be to choose the T that leads to the farthest final distance from the 
starting point, and thus to a fast exploration of the target space 



Hoffman and German 2011 



However, this value is not easy to compute when the particle is being reflected off of many walls. 
Therefore we simply sample T uniformly from [0,7r] at each iteration (recall that the unconstrained 
trajectories are sinusoidal with period 27r). Another option would be to just set T - n. 



The trajectory of the particle is given by (2.10) until it hits a wall, and this occurs whenever 



any of the inequalities (2.2) is saturated. To find the time at which this occurs, it is convenient to 
define 



= E/j x *(*) + * j = l,...,m. 

d d 

= fj a i sin (*) + E ffa cos (*) + 9j 



L=l 



1=1 



= u 



j cos(t + <pj) + gj 



where 



U ; 



\ 



(£/}<x) 2 + (EJ?W 2 . 



i=l 



i=l 



tan (fj 



(2.13) 

(2.14) 
(2.15) 

(2.16) 
(2.17) 



Along the trajectory we have Kj(t) > for all j and a wall hit corresponds to Kj(t) = 0, so from 



(2.15) it follows that the particle can only reach those walls satisfying Uj > \gj\. Each one of those 



reachable walls has associated two times tj > such that 



(2.18) 



and the actual wall hit corresponds to the smallest of all these times. Suppose that the latter occurs 
for j = h. At the hitting point, the particle bounces off the wall and the trajectory continues with a 
reflected velocity. The latter can be obtained by noting that the vector is perpendicular to the 
reflecting plane. Let us decompose the velocity as 



X(t h ) = X L (t h )+a h F h , 



where ■ X 1 (t/ l ) = and 



ah 



F fc -X(i fc ) 



(2.19) 



(2.20) 



The reflected velocity, X^(t/j), is obtained by inverting the component perpendicular to the reflect- 
ing plane 



X R (t h ) = X. ± (t h ) -a h F h , 
= X(t h ) - 2a h F h . 



(2.21) 
(2.22) 
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It is easy to verify that this transformation leaves the Hamiltonian ( 2.6 ) invariant. Once the reflected 



velocity is computed, we use it as an initial condition in (2.12) to continue the particle trajectory 



The run-time of each iteration scales linearly with m, since we have to compute the m values u 



•31 



and, when Uj > \gj\, we need to compute <pj and tj, defined in (2.17) and (2.18). (Of course, the total 



runtime is also proportional to the number of times the particle hits the wall per iteration, which 
varies according to the shape and location of the walls.) The dominant cost is in the computation of 



the sums in expressions (2.16) and (2.17): these can be interpreted as matrix-vector multiplications, 
with cost 0(md) for general constraint matrices F = (F^Fj . . . F^ n ) T . Note that these matrix- 
vector multiplications are highly parallelizable. In addition, in many cases there may be some 
special structure that can be exploited to speed computation further; for example, if F can be 
expressed as a sparse matrix in a convenient basis, this cost can be reduced to 0(d). Note that the 



transformation of a general quadratic form for logp(X), as in (1.1), to the canonical form (2.1) is not 
always computationally efficient, because the constraints also change under the transformation and 
a sparse constraint in the original frame may became dense in the whitened frame. This situation 
occurs in the examples below and in Section [3} For these cases, it can be convenient to keep the 
original distribution in the form (1.1) and consider the Hamiltonian 



H = -X T MX - r T X + -n T M _1 n . 



(2.23) 



As we show in Section [3j such a mass matrix for the momenta also leads to independent trigonomet- 
ric solutions for each coordinate. But the time saved in the fast evaluation of the constraints has a 
trade-off in that now, at each iteration, the momenta II should be sampled from the distribution 
A/"(0,M), with a non-trivial covariance. Again, it is often possible to exploit structure in M to 
speed up computation (e.g., via specialized Cholesky decomposition approaches). 



2.2 Quadratic and Higher Order Inequalities 

The sampling algorithm can be extended in principle to polynomial constraints of the form 

Qi(X)>0 j = l,...,m. (2.24) 



Evaluating Qj(X) at the solution (2.10) leads to a polynomial in sin(t) and cos(t), whose zeros 
must be found in order to find the hitting times. When a wall is hit, we reflect the velocity by 
inverting the sign of the component perpendicular to the wall, given by the gradient VQj(X). 
This vector plays a role similar to Fj in ( 2.19 )-( 2.22). Of course, for general polynomials Qj(X) 
computing the hitting times might be numerically challenging. 

One family of solvable constraints involves quadratic inequalities of the form 



where Aj e \ 



t>d,d 



B,6 



Qj(X) = X' AjX + X • Bj + Cj > j = l,...,m, (2.25) 
For statistics applications where these constraints are important, 



see e.g. Ellis and Maitra 2007 . Inserting (2.10) in the equality for (2.25) leads, for each j, to the 

(2.26) 



following equation for the hitting time: 

qi cos 2 (t) + q2 cos(t) + #3 = - sin(t)(g4 cos(t) + <&) , 
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4 



3 - 





Figure 2: Truncation by quadratic inequalities. Above: 6000 samples of a two-dimensional 



canonical normal distribution, constrained by the quadratic inequalities (2.41 ) -(2.4-2). The piecewise 



elliptic curve shows the trajectory of the particle in the first iterations, with starting point (x, y) = (2,0). 



Below: first 800 iterations of the vertical coordinate. For the algebraic solution of (2.35), we used the 



C++ code from the DynamO package Bannerman et al, 20111 
with 



91 


i,k ik 


(2.27) 


92 : 


- ££%, 

i 


(2.28) 


93 : 


-- C + E^ajafc, 

ik 


(2.29) 


94 : 


- 2Y,A ik ai b k , 

i.k 


(2.30) 


95 : 


- E^ 6 


(2.31) 



and we omitted the j dependence to simplify the notation. If the ellipse in (2.25) is centered at the 
origin, we have Bj = qi = q§ = 0, and equation (2.26) simplifies to 



qi + 2^3 + u sin(2t + ip) = 



where 



u 

tamp 



2 2 
Ql + <?4 ■ 

qi 

94 ' 



(2.32) 

(2.33) 
(2.34) 



and the hit time can be found from (2.32) as in the linear case. In the general B ? t case, the 



square of (2.26) gives the quartic equation 

r4 COS (t) + T3 COS 3 (t) + T2 COS 2 (t) + T\ COs(t) + Tq = , 



where 



2 2 



rz = 2<?i<? 2 + 29495 , 

r 2 = q\ + 2q x q 3 + ql~q\, 

n = 2g 2 <?3 - 2^495 , 



(2.35) 



(2.36) 
(2.37) 
(2.38) 
(2.39) 
(2.40) 



Equation (2.35) can be solved exactly using standard algebraic methods |Herbison- Evans 1994] . A 
wall hit corresponds, among all the constraints j, to the solution for cos(t) with smallest t > and 
|cos(i)| < 1, which also solves (2.26). As an example, Figure [2] shows samples from a two-dimensional 
canonical normal distribution, constrained by 



(x-4) 2 {y-lf 



32 



< 1 



Ax 2 + 8y 2 - 2xy + 5y > 1 . 



(2.41) 
(2.42) 



Equipped with the results for linear and quadratic constraints, we can also find the hitting times 
for constraints of the form 



Q(x) = nQ;( x )^° 

j 



(2.43) 



where each Qj(X) is a linear or a quadratic function. Each factor defines an equation as (2.18) or 



(2.35 ), and the hitting time is the smallest at which any factor becomes zero. For other polynomials, 



one has to resort to numerical methods to find the hitting times. 



2.3 Example: Probit and Tobit Models 

The Probit model is a popular discriminative probabilistic model for binary classification with 
continuous inputs |Albert and Chib 1993| . The conditional probabilities for the binary labels 
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y - ±1 are given by 



\/ ZTT J-oo 

J- f 

p(y = +l|z,/3) = l-$(z-/3) 



dit; e 2 



2?r 

(iu+z-/3) 2 

. dwe 2 

V27T J-oo 



— f 

J2tt Jo 



+ °° (W + Z-/3) 2 

dw e 2 



(2.44) 

(2.45) 
(2.46) 
(2.47) 



where z e M p is a vector of regressors and /3 e MP are the parameters of the model. Given N pairs 
of labels and regressors 



Y= (yi,...,y N ), 
Z = (zi, . . . ,zjv) , 



the posterior distribution of the parameters (3 is 



JV 



p(/3|Y,Z) oc p(p)Hp(yiKP) 



8=1 



p(/3) / <iw 



i=l...JV 



where is the prior distribution. The likelihood p(j/j|zj,/3) corresponds to a model 

j/i = sign{wi) 
Wi = -Zi- (3 + Si 
e l ~ A/"(0,1) 



(2.48) 
(2.49) 



(2.50) 
(2.51) 



(2.52) 
(2.53) 
(2.54) 



in which only the sign of Wi is observed, but not its value. Assuming a Gaussian prior with zero 
mean and covariance a 2 I p , expression (2.51) is the marginal distribution of a multivariate Gaussian 
on (/3, vj\, . . . wn), truncated to yiWi > for % = 1, . . . , N. The untruncated Gaussian has zero mean 
and precision matrix 



]l = (M ww M wf3 \ 
\Mf3 W Mpp) 



pN+p,N+p 



where 



M ww 
M w/3 



In 



M, 



T 

f3w 



6 

/zi\ 

W/ 

JV 



AT, AT 



pN,p 



M, 



13(3 



(2.55) 

(2.56) 
(2.57) 

(2.58) 



We can sample from the posterior in (2.51 ) by sampling from the truncated Gaussian for (/3, w\, . . . wn) 
and keeping only the (3 values. It is easy to show that without the first term in (2.58), coming from 
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Exact HMC 




Gibbs Sampler 





50 100 150 200 



50 100 150 200 




1000 2000 3000 4000 5000 




1000 2000 3000 4000 5000 



Figure 3: Bayesian Probit model. First 200 and 5000 samples from the posterior (2.51) of a model 
with p = 3. N = 800 pairs (yi,Zi) were generated with (3\ = -9,/?2 = 20, /?3 = 27 and we assumed a 
Gaussian prior with zero mean and a 2 = 10 5 . Note that the means of the sampled values are different 
from the values used to generate the data, due to the zero-mean prior. Left: Exact HMC sampler. Right: 



Gibbs sampler, with whitened covariance to improve mixing Rodriguez-Yam et al, 200^ 



the prior p{j3), the precision matrix would have p null directions and our method would not be 
applicable, since we assume the precision matrix to be positive definite. Note that the dimension of 
the TMG grows linearly with the number N of data points. As an illustration, Figure [3] shows the 
values of /3, sampled using Gibbs and exact HMC, from the posterior of a model with p = 3 where 
N = 800 data points were generated. We used z\ - 1, z 2 ~ J7ra/[-5, +5] and zf ~ A/"(-4, a = 4). 
The values of yi were generated with f3\ = -9,02 = 20, 03 = 27 and we assumed a Gaussian prior 
with a 2 = 10 5 . Note that the means of the sampled /3j's are different from the /3j's used to generate 
the data, due to the prior which pulls the /3j's towards zero. For both samplers, we made a coor- 
dinate rotation to the canonical frame in which the unconstrained Gaussian has unit covariance. 
This transformation changes the constraint surface and imposes correlations that make Gibbs mix 
slowly, as shown in Figure [3| One can similarly consider the multivariate Probit model |Ashford" 



and Sowden, 1970 , where the Bayesian approach has been shown to be superior to Maximum 



Likelihood Geweke et al. 1994 . 
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A related model is the Tobit model for censored data [Tobin 1958 , which is a linear regression 
model where negative values are not observed: 



y% = 



y* fory*>0. 
fory;<0. 



where 

y* = Zi-(3 + ei, ei~N(0,a). 
The likelihood of a pair (yi,Zj) is 



p(yi\zi,(3,a) = 



(vi-*i-PY 



i r° 



for?/, > 0. 



— r f ~, dw;e 2 CT 2 for?/j = . 
I V2tvo- 2 J -°° y 

and the posterior probability for (3 is 



N 



p((3\Y,Z,a) oc p (f3\a) UpiViKM 



i=l 

i, Vi >0 i,y i =0 J -° c 



(2.59) 



(2.60) 



(2.61) 



(2.62) 
(2.63) 



As in (2.51), this can be treated as a marginal distribution over the variables Wi, with the joint 
distribution for ((3,Wi) a truncated multivariate Gaussian. 



2.4 Example: Bayesian splines for positive functions 

Suppose we have noisy samples {y%,Xi),i = 1...N, from an unknown smooth positive function 
f(x) > 0, with x e [0,/i]. We can estimate f(x) using cubic splines with knots at the x%'s, plus 
and h Green and Silverman, 1994| . The dimension of the vector space of cubic splines with N 
inner knots is N + 4. Our model is thus 



y, 



AT+4 
s=l 



1...N. 



(2.64) 



where the functions <fi s (x) are a spline basis. Suppose we are interested in the value of f(x) at the 
points x = Zj with j = 1 . . . m. To enforce f(x) > at those points, we impose the constraints 



<f>(zj) • a > , 



3 = 1 • • 



, m . 



where 



4>(x) = (<f>i(x),...,(f>N+4(x)), 
a = (01,. . . jOjv+4) . 



(2.65) 

(2.66) 
(2.67) 



To obtain a sparse constraint matrix, it is convenient to use the B-spline basis, in which only four 
elements in the vector 4>(zj) are non-zero for any j (see, e.g. |De Boor 2001 for details). In a 
Bayesian approach, we are interested in sampling from the posterior distribution 



p(a, ct 2 |Y, X, A) oc p(Y|X, a, a 2 )p(a.\X, o 2 )p(o 2 ) 



(2.68) 
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Figure 4: Bayesian splines for positive functions. The crosses show 50 samples from yi - 
Xi sin 2 (xj) + where e, ~ J\f(0,a) with a 2 = .09. The values of xi were sampled uniformly from [0,2tt]. 
The curve f(x) = x sin 2 (x) is shown as a dashed line. The shaded band shows the splines built with 



coefficients from the .25 and .75 quantiles o f samples from the posterior distribution of a in (2.68). We 
used a Jeffreys prior for a 2 'Jeffreys, 194&1 and imposed the positivity constraints (2.65) at 100 points 
spread uniformly in [0,27r]. The smoothness parameter A was estimated as X - 0.0067 by maximizing 
the marginal likelihood (empirical Bayes criterion), using a Monte Carlo EM algorithm. The mean of 
the samples of a 2 was a 2 = 0.091. The spline computations were performed with the "fda" MATLAB 
package Ramsay et al. , 2003] . 



where we defined 



Y = (y u ...,y N ), 
X = (xi,...,x N ). 



The likelihood is 



1 / 1 N \ 

p(Y\X, a ,a 2 )= {27r(j2)N/2 exp g(y, - a • cf>( Xi )) 2 j , 

and for the prior on a we consider 

p(a|A,cr 2 ) oc ^ — j eX p^___^ dx (a- 4>"(x)f^j , 



N+4, 

X {£) 2 eXP ("2^ aTKa )' 



(2.69) 
(2.70) 



(2.71) 

(2.72) 
(2.73) 
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where K e . 



5 Af+4,Af+4 



has entries 



rh r/ 

= J dx (j)g(x)(j)['.(x) . 



(2.74) 



The prior ( 2.72 )-( 2.73) is standard in the spline literature and imposes a A-dependent penalty on 
the roughness of the estimated polynomial, with a bigger A corresponding to a smoother solution. 
This penalty allows to avoid overfitting the data |Green and Silverman 1994| . 

We can Gibbs sample from the posterior (2.68) by alternating between the conditional distribu- 
tions of a 2 and a. The latter is a TMG with 



logp(a|a 2 ,X,Y,A)oc 



1 

2^ 



a T (M + AK)a + — • r , s = l...N + A. 



1 



constrained by (2.65), and we defined 



N 

M = 2>(si)0(xi) 5 

N 



D iV+4,iV+4 



,7V+4 



(2.75) 

(2.76) 
(2.77) 



Figure |4] shows an example for the function f(x) = x sin 2 (:r), with N - 50 points sampled as 

yi = Xi sm 2 (xi) + ei £;~.Af(0, a) a 2 = .09 , (2.78) 
and with the Xi sampled uniformly from [0,2-7r]. 



3 The Bayesian Lasso 



The techniques introduced above can also be used to sample from multivariate distributions whose 
log density is piecewise quadratic, with linear or elliptical boundaries between the piecewise regions. 
Instead of presenting the most general case, let us elaborate the details for the example of the 



Bayesian lasso Park and Casella, 2008 Hans, 2009, Poison and Scott, 2011 



We are interested in the posterior distribution of the coefficients (3 € ~K d and a 2 of a linear 
regression model 



e n ~Af(0,a) 



71 



1. 



,N. 



Defining 



Y = (yi,...,y N ) 
Z = (zi, . . . ,z N ) , 

we want to sample from the posterior distribution 

p(/3, a 2 \Y, Z, A) oc p(Y|Z, (3, a 2 )p(f3\X, a 2 )p(a 2 ) , 

with prior density for the coefficients 



p(P\X,a 2 ) = 



\2a 2 



(3.1) 



(3.2) 
(3.3) 



(3.4) 



(3.5) 
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This prior is called the lasso (for 'least absolute shrinkage and selection operator') and imposes a 
A-dependent sparsening penalty in the maximum likelihood solutions for /3 |Tibshirani 1996| . 

We can Gibbs sample from the posterior (3.4) by alternating between the conditional distribu- 
tions of a 2 and (3. The latter is given by 



N 



A 



logp(f3\Y,Z,a 2 ,X) = -^£(z n ./3-y n ) 2 + -£|A 

l0 ~ n=l °~ i=l 



2o-' 



a 



i=l 



where we defined 



with 



N 

n=l 

AT 

^ l ( s i) = E Vn{zi)n - As, 
n=l 



s» = sign(/3j) 



i = l. 



(3.6) 
(3.7) 

(3.8) 
(3.9) 

(3.10) 



Sampling f3 from (3.7) was considered previously via Gibbs sampling, either expressing the Laplace 



prior (3.5) as mixtures of Gaussians [Park and Casella, 2008 or Bartlett-Fejer kernels Poison and 



Scott 



2011 , or directly from (3.7) [Hans 



2009 



In order to apply Hamiltonian Monte Carlo we consider the Hamiltonian 

d J2 



H = -^/3 T M/3 - ^ £ L\s t )f3 t + ^-II T M . 
la a i=1 2 



(3.11) 



Note that we did not map the coordinates to a canonical frame, as in Section 2.1. Instead, we chose 
a momenta mass matrix <r~ 2 M, which is equal to the precision matrix of the coordinates. This 
choice leads to the simple equations 



h = ~Pi + Mi( s ) , 



(3.12) 



where 



The solution to (3.12) is 



where 



Pi(t) = Mi(s) + di sin(t) + 6j cos(t) 
= m(s) + AiCos(t + <pi), 



A; 



tan tpi 



a i 



(3.13) 



(3.14) 
(3.15) 



(3.16) 
(3.17) 
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The constants ai, b{ in (3.14) can be expressed in terms of the initial conditions as 



h = ft(0)-/ii(s) 
= M^V'(O). 



(3.18) 
(3.19) 
(3.20) 



As in Section 



2.1 



we start by sampling II from p(II|/3) = p(H) = A/"(0, <7~ 2 M) and let the particle 
move during a time T sampled uniformly from [0,7r]. The trajectory of the particle is given by 
(3.15) until a coordinate crosses any of the = planes, which happens at the smallest time t > 
such that 



= jUj(s) + Ai cos(t + 



» = 1, 



,d, 



(3.21) 



(Note that had we transformed the coordinates j3 to a canonical frame, each condition here would 
have involved a sum of <i terms; thus the parameterization we use here leads to sparser, and therefore 
faster, computations.) Suppose the constraint is met for i = j at time t = tj. At this point j3j changes 



sign, so the Hamiltonian (3.11) changes by replacing 



(3.22) 



which in turn changes the values of ;Uj(s)'s in ( |3.13| ). Note from ( |3.12 ) that this causes a jump in 
(3(tj). Using the continuity of (3(tj), (3{tj) and the updated /Uj(s)'s, we can compute new values 
for ai and bi as in (3.18) and (3.19) to continue the trajectory for times t> tj. 

The piecewise linear log-density (3.6) is continuous with discontinuous derivative, but we can also 



consider discontinuous log-densities defined piecewise. In these cases, the velocity is not continuous 
across the boundary of two regions, but jumps in such a way that the total energy is conserved. The 
extension of the basic method to this case is straightforward. Also, combining this algorithm with 
the imposition of constraints of the previous section, the HMC technique can be used to sample 
the posterior of a lasso model with additional constraints on the /3j's, such as the tree shrinkage 



model |LeBlanc and Tibshirani 1998 or the hierarchical lasso [Bien et ah 2012] . 
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